Introduction

The first official case of COVID-19 in the USA has been confirmed on the 21st of January. About three months later, almost 1 million cases have been discovered. In this context of this pandemic, it is of utmost importance to understand how the pandemic evolves by reporting data in a clear an insgihtful way.

This project has two main goals:

  • visualize different COVID-related metrics: infection rate / 100k individuals, total cases, and total deaths at the county level (by representing maps and time series)

  • identify counties with potential errors in official counts: it has been shown that some counties negative counts in cumulative cases, which is not possible.

Notes about the report:

  • R code used to generate this report is provided. You just need to click on the “Code” button on the right to display the code used in a given section.

  • Most graphs and tables are interactive: you can zoom in and out, click on elements to display more content, or search for specific data points.

Data preparation

Dependencies and input files

We first need to load R packages (mostly used for interactive visualizations).

Then we set the location of input files used in this project:

  • a web scraped data file (april15.csv) containing COVID data for a single time-point

  • official population counts in US counties (census), directly taken from the web

  • the New-York Times COVID-19 data, taken from the web

  • official counties borders to be displayed on the map (downloaded from census.gov)

Getting and preparing NYT data

Another dataset we’re gonna look at is provided by the New-York Times. We load it as a data frame containing different variables: the date, county, state, FIPS (county ID), cases and deaths. We format the county ID just like before.

Compute rate per 100k using census data for both datasets

For the two COVID datasets, we have the number of cases and deaths. We can compute for each dataset two other metrics: the rate of cases per 100k inhabitants, and the rate of deaths per 100k.

Maps based on web-scraped and NYT data

We want to represent our COVID data on a map of the USA. To do so, we will generate maps using the Leaflet framework. The idea is to add polygons representing counties to the basemap. Polygon coloring depends on the metric of interest (here, total cases or rate /100k). Clicking on a county gives more information about this area.

Total number of cases (April 15 data)

First, we represent the total number of cases in each county.

Rate per 100,000 (April 15 data)

Second, we represent the rate of cases per 100,000 individuals.

Prevalence with (NYT data, last day available)

Time series representation based on the New York Times data

We can now have a look at COVID data over time. To do so, we will represent COVID cases evolution for the 100 most affected counties (most affected = highest number of cases so far).

Total cases over time

Rate of infection over time

Identifying counties with negative numbers

Data table

Time series for counties with largest discrepancies

##  [1] "Cullman 01043"     "Onondaga 36067"    "Tazewell 17179"   
##  [4] "Dougherty 13095"   "Carson City 32510" "Ripley 18137"     
##  [7] "Oakland 26125"     "Madison 01089"     "Lafayette 22055"  
## [10] "Rensselaer 36083"  "St. Charles 22089" "Tuscaloosa 01125" 
## [13] "Lexington 45063"   "St. Landry 22097"

Represent counties with discrepancies on a map

Work in progress.